Goto

Collaborating Authors

 adversarial suffix



Poems Can Trick AI Into Helping You Make a Nuclear Weapon

WIRED

It turns out all the guardrails in the world won't protect a chatbot from meter and rhyme. You can get ChatGPT to help you build a nuclear bomb if you simply design the prompt in the form of a poem, according to a new study from researchers in Europe. The study, Adversarial Poetry as a Universal Single-Turn Jailbreak in Large Language Models (LLMs)," comes from Icaro Lab, a collaboration of researchers at Sapienza University in Rome and the DexAI think tank. According to the research, AI chatbots will dish on topics like nuclear weapons, child sex abuse material, and malware so long as users phrase the question in the form of a poem. "Poetic framing achieved an average jailbreak success rate of 62 percent for hand-crafted poems and approximately 43 percent for meta-prompt conversions," the study said. The researchers tested the poetic method on 25 chatbots made by companies like OpenAI, Meta, and Anthropic . It worked, with varying degrees of success, on all of them. WIRED reached out to Meta, Anthropic, and OpenAI for a comment but didn't hear back. The researchers say they've reached out as well to share their results. AI tools like Claude and ChatGPT have guardrails that prevent them from answering questions about "revenge porn" and the creation of weapons-grade plutonium. But it's easy to confuse those guardrails by adding " adversarial suffixes " to a prompt. Basically, add a bunch of extra junk to a question and it confuses the AI and bypasses its safety systems. The poetry jailbreak is similar. "If adversarial suffixes are, in the model's eyes, a kind of involuntary poetry, then real human poetry might be a natural adversarial suffix," the team at Icaro Lab, the researchers behind the poetry jailbreak, tell WIRED. "We experimented by reformulating dangerous requests in poetic form, using metaphors, fragmented syntax, oblique references.



LatentBreak: Jailbreaking Large Language Models through Latent Space Feedback

Mura, Raffaele, Piras, Giorgio, Lukošiūtė, Kamilė, Pintor, Maura, Karbasi, Amin, Biggio, Battista

arXiv.org Artificial Intelligence

Jailbreaks are adversarial attacks designed to bypass the built-in safety mechanisms of large language models. Automated jailbreaks typically optimize an adversarial suffix or adapt long prompt templates by forcing the model to generate the initial part of a restricted or harmful response. In this work, we show that existing jailbreak attacks that leverage such mechanisms to unlock the model response can be detected by a straightforward perplexity-based filtering on the input prompt. To overcome this issue, we propose LatentBreak, a white-box jailbreak attack that generates natural adversarial prompts with low perplexity capable of evading such defenses. LatentBreak substitutes words in the input prompt with semantically-equivalent ones, preserving the initial intent of the prompt, instead of adding high-perplexity adversarial suffixes or long templates. These words are chosen by minimizing the distance in the latent space between the representation of the adversarial prompt and that of harmless requests. Our extensive evaluation shows that LatentBreak leads to shorter and low-perplexity prompts, thus outperforming competing jailbreak algorithms against perplexity-based filters on multiple safety-aligned models.




Adversarial Manipulation of Reasoning Models using Internal Representations

Yamaguchi, Kureha, Etheridge, Benjamin, Arditi, Andy

arXiv.org Artificial Intelligence

Reasoning models generate chain-of-thought (CoT) tokens before their final output, but how this affects their vulnerability to jailbreak attacks remains unclear. While traditional language models make refusal decisions at the prompt-response boundary, we find evidence that DeepSeek-R1-Distill-Llama-8B makes these decisions within its CoT generation. We identify a linear direction in activation space during CoT token generation that predicts whether the model will refuse or comply -- termed the "caution" direction because it corresponds to cautious reasoning patterns in the generated text. Ablating this direction from model activations increases harmful compliance, effectively jailbreaking the model. We additionally show that intervening only on CoT token activations suffices to control final outputs, and that incorporating this direction into prompt-based attacks improves success rates. Our findings suggest that the chain-of-thought itself is a promising new target for adversarial manipulation in reasoning models. Code available at https://github.com/ky295/reasoning-manipulation.


Universal and Transferable Adversarial Attack on Large Language Models Using Exponentiated Gradient Descent

Biswas, Sajib, Nishino, Mao, Chacko, Samuel Jacob, Liu, Xiuwen

arXiv.org Artificial Intelligence

As large language models (LLMs) are increasingly deployed in critical applications, ensuring their robustness and safety alignment remains a major challenge. Despite the overall success of alignment techniques such as reinforcement learning from human feedback (RLHF) on typical prompts, LLMs remain vulnerable to jailbreak attacks enabled by crafted adversarial triggers appended to user prompts. Most existing jailbreak methods either rely on inefficient searches over discrete token spaces or direct optimization of continuous embeddings. While continuous embeddings can be given directly to selected open-source models as input, doing so is not feasible for proprietary models. On the other hand, projecting these embeddings back into valid discrete tokens introduces additional complexity and often reduces attack effectiveness. We propose an intrinsic optimization method which directly optimizes relaxed one-hot encodings of the adversarial suffix tokens using exponentiated gradient descent coupled with Bregman projection, ensuring that the optimized one-hot encoding of each token always remains within the probability simplex. We provide theoretical proof of convergence for our proposed method and implement an efficient algorithm that effectively jailbreaks several widely used LLMs. Our method achieves higher success rates and faster convergence compared to three state-of-the-art baselines, evaluated on five open-source LLMs and four adversarial behavior datasets curated for evaluating jailbreak methods. In addition to individual prompt attacks, we also generate universal adversarial suffixes effective across multiple prompts and demonstrate transferability of optimized suffixes to different LLMs.


Advancing Jailbreak Strategies: A Hybrid Approach to Exploiting LLM Vulnerabilities and Bypassing Modern Defenses

Ahmed, Mohamed, Abdelmouty, Mohamed, Kim, Mingyu, Kandula, Gunvanth, Park, Alex, Davis, James C.

arXiv.org Artificial Intelligence

--The advancement of Pre-Trained Language Models (PTLMs) and Large Language Models (LLMs) has led to their widespread adoption across diverse applications. Despite their success, these models remain vulnerable to attacks that exploit their inherent weaknesses to bypass safety measures. Two primary inference-phase threats are token-level and prompt-level jailbreaks. T oken-level attacks embed adversarial sequences that transfer well to black-box models like GPT but leave detectable patterns and rely on gradient-based token optimization, whereas prompt-level attacks use semantically structured inputs to elicit harmful responses yet depend on iterative feedback that can be unreliable. T o address the complementary limitations of these methods, we propose two hybrid approaches that integrate token-and prompt-level techniques to enhance jailbreak effectiveness across diverse PTLMs. GCG + PAIR and the newly explored GCG + WordGame hybrids were evaluated across multiple Vicuna and Llama models. GCG + PAIR consistently raised attack-success rates over its constituent techniques on undefended models; for instance, on Llama-3, its Attack Success Rate (ASR) reached 91.6%, a substantial increase from PAIR's 58.4% baseline. Meanwhile, GCG + WordGame matched the raw performance of WordGame maintaining a high ASR of over 80% even under stricter evaluators like Mistral-Sorry-Bench. Crucially, both hybrids retained transferability and reliably pierced advanced defenses such as Gradient Cuff and JBShield, which fully blocked single-mode attacks. These findings expose previously unreported vulnerabilities in current safety stacks, highlight trade-offs between raw success and defensive robustness, and underscore the need for holistic safeguards against adaptive adversaries. Large Language Models (LLMs)--such as GPT -4, LLaMA, and Claude--have become indispensable in healthcare, finance, education, and other high-stakes domains [1]-[3]. Their ability to understand context, generate human-like responses, and adapt to diverse tasks fuels widespread deployment. Y et these same models remain vulnerable to jailbreak attacks, which exploit weaknesses in alignment mechanisms to induce harmful or disallowed content [4].


Excessive Reasoning Attack on Reasoning LLMs

Si, Wai Man, Li, Mingjie, Backes, Michael, Zhang, Yang

arXiv.org Artificial Intelligence

However, prior studies have shown that these models often incur significant computational costs due to excessive reasoning, such as frequent switching between reasoning trajectories (e.g., underthinking) or redundant reasoning on simple questions (e.g., overthinking). In this work, we expose a novel threat: adversarial inputs can be crafted to exploit excessive reasoning behaviors and substantially increase computational overhead without compromising model utility. Therefore, we propose a novel loss framework consisting of three components: (1) Priority Cross-Entropy Loss, a modification of the standard cross-entropy objective that emphasizes key tokens by leveraging the autoregressive nature of LMs; (2) Excessive Reasoning Loss, which encourages the model to initiate additional reasoning paths during inference; and (3) Delayed Termination Loss, which is designed to extend the reasoning process and defer the generation of final outputs. We optimize and evaluate our attack for the GSM8K and ORCA datasets on DeepSeek-R1-Distill-LLaMA and DeepSeek-R1-Distill-Qwen. Empirical results demonstrate a 3x to 9x increase in reasoning length with comparable utility performance. Furthermore, our crafted adversarial inputs exhibit transferability, inducing computational overhead in o3-mini, o1-mini, DeepSeek-R1, and QWQ models.